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Abstract 

Social media is becoming an increasingly important 
source of information to complement traditional phar- 
macovigilance methods. In order to identify signals 
of potential adverse drug reactions, it is necessary to 
first identify medical concepts in the social media text. 
Most of the existing studies use dictionary-based meth¬ 
ods which are not evaluated independently from the 
overall signal detection task. 

We compare different approaches to automatically 
identify and normalise medical concepts in consumer 
reviews in medical forums. Specifically, we implement 
several dictionary-based methods popular in the rele¬ 
vant literature, as well as a method we suggest based on 
a state-of-the-art machine learning method for entity 
recognition. MetaMap, a popular biomedical concept 
extraction tool, is used as a baseline. Our evaluations 
were performed in a controlled setting on a common 
corpus which is a collection of medical forum posts an¬ 
notated with concepts and linked to controlled vocab¬ 
ularies such as MedDRA and SNOMED CT. 

To our knowledge, our study is the first to system¬ 
atically examine the effect of popular concept extrac¬ 
tion methods in the area of signal detection for ad¬ 
verse reactions. We show that the choice of algorithm 
or controlled vocabulary has a significant impact on 
concept extraction, which will impact the overall sig¬ 
nal detection process. We also show that our proposed 
machine learning approach significantly outperforms all 
the other methods in identification of both adverse re¬ 
actions and drugs, even when trained with a relatively 
small set of annotated text. 


1 Introduction 


Adverse Drug Reactions (ADRs), also known as drug 
side effects, are a major concern for public health, 
costing health care systems worldwide millions of dol- 
lars [Hug et al.[ |2Q12[ [Ehsani et al.[ |2QQ6[ [Ronghead 


and Semple 2009 . An ADR is an injury caused 


by a medication that is administered at the recom¬ 
mended dosage, for recommended symptoms. The tra¬ 
ditional pharmacovigilance methods have shown limi¬ 
tations that have prompted the search for alternative 
sources of information that might help identify signals 
of potential ADRs. These signals can then be used 
to select which cases warrant a more thorough review. 
These assessments are performed by regulatory agen¬ 


cies such as the Eood and Drug Administration (EDA) 
in the United States and the Therapeutic Goods Ad¬ 
ministration (TGA) in Australia, and intend to estab¬ 
lish a causal effect between the drug and the ADR. If a 
causal link is found, and depending on the severity of 
the ADR, it will be added to the drug’s label or it might 
even trigger a removal of the drug from the market if 
it is considered life-threatening. 

Social media has been identified as a potential source 
of information that could be used to find signals of 
potential ADRs Benton et al. 2011 . A public opin¬ 


ion survey conducted by The Pew Research Genter’s 
Global Attitudes Project in 2009 Eox and Jones, 2009 


showed that 61% of American adults looked for health 
information online, 41% had read about someone else’s 
experience, and 30% were actively creating new con¬ 
tent. These numbers give a strong indication about 
the growing importance of social media in the area of 
health. 

Several attempts at extracting ADR signals from so¬ 
cial media have shown promising results Benton et al. 
2011[ [L^aman et |2Q10| [Yang et ahj |2012[ |L: 


Ghen 


2013 


.jIU and 


However, all of these techniques first need 
to identify concepts of interest, such as mentions of ad¬ 
verse effects, in the social media text which is unstruc¬ 
tured and noisy. Most current approaches use medi¬ 
cal concept identification techniques based on dictio¬ 
nary lookup, but do not evaluate this step indepen¬ 
dently [Metke-Jimenez et al. 


2014 . This step is critical 


because errors can affect the subsequent stages of the 
signal detection process. The problem of concept iden¬ 
tification and normalisation — linking the identified 
concepts to their corresponding concepts in controlled 
vocabularies — has been studied extensively in the con¬ 
text of Natural Language Processing (NLP) of clinical 
notes, but these techniques have not been used in the 
context of ADR mining from social media. The main 
reasons for this are the lack of publicly available cor¬ 
pora with gold-standard annotations and the difficulty 
in identifying specific concepts, such as adverse reac¬ 
tions, in lay people language. The noisy nature of social 
media text makes concept identification a hard prob¬ 
lem. Eor example, in the corpus used in this work the 
drug Lipitor is spelled in seven different ways (Lipitor, 
Liptor, Lipitol, Lipiltor, Liptior, Lipior and Litpitor) 
and it is also written using different case combinations 
(e.g. Lipitor, LIPITOR and lipitor). 

The contributions of this paper are twofold: 


1. Several existing concept identification and normal- 
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isation methods are evaluated in the domain of ad¬ 
verse effect discovery from social media, including 
the dictionary-based methods applied in the recent 
ADR mining literature, as well as a state-of-the- 
art machine learning method that has been used 
successfully in similar tasks in other domains; and, 

2. A variety of evaluation metrics are used, and in 
some cases proposed, to compare the effectiveness 
of different methods, including the statistical anal¬ 
ysis of performance improvement. 

2 Background 

This section starts by clarifying the terminology that is 
used throughout the paper. Then, a brief introduction 
to the controlled vocabularies that we refer to in the 
literature and our experiments is given. 

2.1 Terminology Clarification 

Some of the terminology used in this paper has been 
used inconsistently in the literature. We use the terms 
concept and entity in free text interchangeably. Con¬ 
cept recognition and concept identification are also 
treated as synonyms. We refer to concept extraction 
as a process of concept identification followed by nor¬ 
malisation / mapping to controlled vocabularies. 

We also note that the methods we refer to as 
dictionary-based are also known as lexicon-based or 
lexicon lookup. 

2.2 Controlled vocabularies 

Controlled vocabularies are typically used to identify 
medical concepts in free text. This section provides 
background on controlled vocabularies that are com¬ 
monly used in the relevant literature. Note that some 
of these resources are really taxonomies or ontologies, 
but are used as controlled vocabularies in the context 
of this paper. 

MedDRA The Medical Dictionary for Regulatory 
Activitie^is a thesaurus of ADRs used internationally 
by regulatory agencies and pharmaceutical companies 
to consistently code ADR reports. 

Before MedDRA, the FDA had developed the Coding 
Symbols for a Thesaurus of Adverse Reaction Terms 
(COSTART), which is now obsolete. 

CHV The Consumer Health Vo cabular}0 provides a 
list of health terms used by lay people, including fre¬ 
quent misspellings. For example, it links both lung 
tumor and lung tumour to lung neoplasm. 

SNOMED CT The Systematized Nomenclature of 
Medicine - Clinical Term^is a large ontology of med¬ 
ical concepts that has been recommended as the ref¬ 
erence terminology for clinical information systems 

^ http://www.meddra.org/ 

^ http://WWW.consumerhealthvocab.org/ 

^ http://WWW.iht sdo.org/snomed-ct/ 


in countries such as Australia, the United Kingdom, 
Canada, and the United States Lee et al. 2013 . It in¬ 


cludes formal definitions, codes, terms, and synonyms 
for more than 300,000 medical concepts. Several ver¬ 
sions of the ontology exist, including an international 
version and several country-specific versions that ex¬ 
tend the international version to add local content and 
synonyms. 


UMLS The Unified Medical Language Systenj^is a 
collection of several health and biomedical controlled 
vocabularies, including MedDRA, SNOMED CT, and 
CHV. Terms in the controlled vocabularies are mapped 
to UMLS concepts. It also provides a semantic net¬ 
work that contains semantic types linked to each other 
through semantic relationships. Each UMLS concept 
is assigned one or more semantic types. 


AMT The Australian Medicines Terminolog30 is an 
extension of the Australian version of SNOMED CT 
that provides unique codes and accurate, standardised 
names that unambiguously identify all commonly used 
medicines in Australia. 


3 Related Work 


Although there is a large body of literature on generic 
information extraction from formal text such as news 
and social media, especially Twitter, there is limited 
work on the specific area of ADR detection. ADR 
signal detection has been studied in spontaneous re¬ 
porting systems Bate and Evans 2009 , medical case 


reports Gurulingappa et al. 2012 


Health Records Eriedman 2009 


, and Electronic 
A comprehensive 
survey of text and data mining techniques used for 
ADR signal detection from several sources, including 
social media, can be found in 


Karimi et al., 2015b 


Below, we review the most relevant ADR extraction 
techniques used in social media. We also review the 
state of the art in medical concept identification and 
normalisation in the context of clinical notes. 


3.1 ADR Extraction from Social Media 


Medical forums are online sites where people discuss 
their health concerns and share their experience with 
other patients or health professionals. Actively mining 
these forums could potentially reveal safety concerns 
regarding medications before regulators discover them 
through more passive methods via official channels such 
as health professionals. 

[Leaman et al. 2010 proposed to mine patients’ com¬ 
ments on health related web sites, specifically Dai- 
lyStrengtlj^ to find mentions of adverse drug events. 
They used a lexicon that combines COSTART and a 
few other sources to extract ADR-related information 
from text. In a preprocessing step, they break the 


http: / / WWW. nlm. nih. gov/research/umls 
^http://www.nehta.gov.au/our-work/ 
clinical-terminology/australian-medicines-terminology 
http: //WWW. dailystrength. org/ 
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posts into sentences, tokenise the sentences, run a Part- 
of-Speech (POS) tagger, remove stopwords, and stem 
the words using the Porter stemmer. Using a sliding 
window approach, they match the lexicon entries with 
the preprocessed text and then evaluate the matches 
against the manually annotated text. Their data was 
annotated with ADRs, beneficial effects, indications, 
and others. We evaluate a similar method without tak¬ 
ing into account the similarity between the tokens. 

applied Naive Bayes and Support 


Ghee et al. 2011 


Benton et al. 2011 


the approach used by Leaman et al. (2010), as they did 


Yang et al. 2012 


None of these studies 

Chee et al. 2011 Benton et al. 

2011 

Yang et al. 2012 

evaluated the information ex- 


Liu and Chen 2013 


NegEx Chapman et al. 2001 to identify negations in 


Sampathkumari et al 


their concept recognition module they relied on a 
dictionary-lookup method.They created a dictionary of 
drug names from the drugs.com website and a dictio¬ 
nary of adverse drug effects from SIDER, a resource 
that lists side effect terminology. The concept recogni¬ 
tion step was not evaluated on its own. 


Metke-Jimenez et al. 2014 empirically evaluated a 


Vector Machine classifiers to identify drugs that could 
potentially become part of the watchlist of the US reg¬ 
ulatory agency, the EDA. They used patients posts on 
Health and Wellness Yahoo! Groups. The text was pro¬ 
cessed to generate features for the classifiers. They had 
two sets of features: all the words from the posts, and 
only those words that matched their controlled vocab¬ 
ulary (that included MedDRA and a list of diseases). 
Misspellings were not fixed. 

extracted potential ADRs from 


a number of different breast cancer forums (such as 
breast cancer. org) by using frequency counts of terms 
from a controlled vocabulary in their corpus and then 
using association rule mining to establish the relation¬ 
ship between the matching terms. Association rule 
mining is a data mining approach popular for mining 
ADRs from regulatory and administrative databases. 
The method by Benton et al. was an advancement on 


not stop at just the extraction of interesting concepts, 
but also proposed a method to establish a relationship 
between the extracted terms. 

studied signal detection from a 


medical forum called MedHelp using data mining ap¬ 
proaches. They extended the existing association rule 
mining algorithms by adding “interestingness” and 
“impressiveness” metrics. They had to find mentions 
of ADRs in the text to process the forum data and cal¬ 
culate confidence and leverage. To do this, they used a 
sliding window and the CHV as a controlled vocabulary 
to match the terms. 


lexicon-based concept identification mechanism, sim¬ 
ilar to the ones reported in the existing literature, 
and tested different combinations of preprocessing tech¬ 
niques and controlled vocabularies using a manually 
annotated data set of medical forum posts from the 
AskaPatient website. The results showed that the best 
performing controlled vocabulary was the CHV, but 
the overall performance was quite poor. Our work has a 
similar goal but differs because we compare more meth¬ 
ods, including a baseline method and a state of the art 
machine learning method. Also, the data set we use is 
larger and contains a wider variety of posts. The task 
we evaluate also includes concept normalisation which 
requires mapping the spans that were identified to a 
corresponding concept in a controlled vocabulary. Ei- 
nally, we use more comprehensive metrics to compare 
the relative performance of the different techniques un¬ 
der evaluation. 

3.2 Medical Concept Identification and 
N or malisat ion 

The problem of medical concept identification and nor¬ 
malisation has been extensively studied by the clinical 
text mining community. Early work often relied on 
pattern matching rules — e.g.. 


Evans et ah, 1996 


or used MetaMap as a tool to identify concepts using 
the UMLS Metathesaurus — e.g., [Jimeno et al. 2008 . 

More recently, several open challenges have bol¬ 
stered the research in this area, including the i2b2 


2010 


traction step on its own, which we will cover in this 
study. 

implemented a system called 


AZDrugMiner. Data was collected using a crawler 
and was then post-processed by removing any HTML 
tags and extracting text for further analysis. They 
then used an NLP tool called OpenNLP to break the 
text into sentences. To find the relevant parts of each 
sentence, for example mentions of a drug, they used 


MetaMap Aronson 2001 , which maps text to UMLS 


concepts. After this stage, they extracted relations us¬ 
ing co-occurrence analysis. They also used a tool called 


the text. This work uses MetaMap for the concept ex¬ 
traction step which we use as a baseline in our work. 


2014 proposed to used a ma¬ 
chine learning approach. Hidden Markov Model, to 
extract relationships between drugs and their side ef¬ 
fects in a medical forum called medications.com. Eor 


Medication Extraction Challenge Uzuner et al 
ShARe/CLEE eHealth Evaluation Lab 2013 and 
SemEval-2014. 

In 2010, the i2b2 medication extraction challenge 
was introduced as an annotation exercise. Participat¬ 
ing teams were given a small number of discharge sum¬ 
maries (10 per person) to annotate for mentions of med¬ 
ications, the way these medications were administered 
(dosage, duration, frequency, and route), as well as rea¬ 
sons for taking the medications. To complete this chal¬ 
lenge some participating teams used automated meth¬ 
ods as well as manual reviews. |Mork et al. 2010| , for 
example, used a combination of dictionary lookup (e.g., 
UMLS, RxTerms, Daily Med) and concept annotation 
tools (e.g., MetaMap) to find the concepts. 

Task 1 of the ShARe/CLEE eHealth Evaluation Lab 
2013 [Pradhan et al.[ |2Q13| used the ShARe corpus, 
which provides a collection of annotated, de-identified 
clinical reports from US intensive care units (version 
2.5 of the MIMIC H databas^. 

The task was divided into two parts. The goal of 
part A was to identify spans that represent disorders^ 
defined as any text that can be mapped to a SNOMED 


^ http://mimic.physionet.org 
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Table 2: Best reported scores in part B (SNOMED CT 
mapping) of ShARe/CLEF 2013 Task 1 and SemEval 
2014 Task 7. 


Task Strictness Accuracy 


CLEF 

Leaman et al. 

2013 

, strict 



j 0.741 

SemEval [Zhang et al. 

|2Ui4 

CLEF 

SemEvc 

Leaman et al. 

2013 

n Relaxed 

il Zhang et al. 

20D 

Kelaxed ^ 


CT concept in the Disorder semantic group. The goal 
of Part B was to map these spans to SNOMED CT 
codes. Part A was evaluated using precision, recall, 
and F-Score (see Section 4.1 for the definition of these 
metrics). Evaluation for concept identification was di¬ 
vided into two categories: strict and relaxed. The strict 
version required that the annotations match exactly, 
while the relaxed version did not. Part B was evalu¬ 
ated using accuracy, which was defined as the number 
of pre-annotated spans with correctly generated codes 
divided by the total number of pre-annotated spans 
(note that in this paper this metric is referred to as 
effectiveness; see Section 4.2). The strict version con¬ 
sidered the total number of pre-annotated spans to be 
the total number of entities in the gold standard. The 
relaxed version considered the total to be the number 
of strictly correct spans generated by the system in part 
A. 

All the best performing systems for part A (concept 
identification) used machine learning algorithms, in¬ 
cluding Conditional Random Fields (CRF) and Struc¬ 
tural Support Vector Machines (SSVM) [Tang et al. 


2013 Teaman et al. 2013 Gnng[ [^13| . Our work ex¬ 


tends on the definition of this task by increasing the 
number of concepts to be identified, and tailoring it to 
the adverse effect signal detection area. We also target 
forum data which raises its own specific challenges due 
to language irregularities. 

Task 7 at SemEval-2014 was a continuation of the 
CLEF 2013 task, but used more data for training and 
introduced a new test set Pradhan et al.[ 2014 . The 
best scores obtained in these challenges are shown in 
Tables [T] and O 

The main differences between the different imple¬ 
mentations submitted to these challenges was the se¬ 
lection of features used as input to the machine learn¬ 
ing algorithms. Table shows some of the common 
features used by different systems. Note that not all 
systems reported their features. 

Apart from these open challenges, a recent study 
by Ramesh et al. 2014 also proposed using supervised 
machine learning, including NaiVe Bayes, support vec¬ 
tor machines, and conditional random fields, to anno¬ 
tate ADR reports collected by the FDA for drugs and 
adverse effects and then reviewing the annotation us¬ 
ing human annotators. The main goal of this study 
however was developing an annotated corpus of drug 
reviews and therefore different to our study in terms of 
the evaluations involved. 

There is another line of studies that are also referred 


Table 3: Some of the common features used in machine 
learning approaches to disorder span identification. 


Feature 

Description 

Bag of words 

The words that surround each to¬ 
ken. 

PCS tags 

The part of speech tag assigned to 
the token. 

Word shape 

Indicates the shape of the token, for 
example, if the token is composed of 
only lower case letters, upper case 
letters, or a combination. 

Type of notes 

The corpus includes different types 
of clinical notes (e.g. discharge 
summaries, radiology reports, etc.). 
This feature indicates the type of 
note that contains the token. 

Section information 

Indicates the section of the note 
that contains the token (e.g. Past 
Medical History). 

Semantic mapping 

The concept assigned to a token by 
an existing tool, such as MetaMap 
or cTAKEs. 


as normalisation^ or more specifically social text nor¬ 
malisation, in the natural language processing domain. 


Studies such as 

Hassan and Menezes 2013 Ling et al. 

2013 Chrupala 

2014 propose algorithms to restore the 


standard or formal form of non-standard words that ap¬ 
pear frequently in social media text. For example, chk 
and abf two abbreviations common in Twitter, may be 
normalised to check and about. In our work, we refer 
to normalisation as mapping specific medical concepts 
to biomedical ontologies and controlled vocabularies, 
which is different to transforming a given free text ab¬ 
breviation to its formal equivalent. 


4 Problem Formulation 

Our goal is to evaluate the concept identification and 
normalisation step independently from the overall task 
of signal detection in free-text. We restrict our task 
to focus on social media, specifically medical forums. 
Apart from the challenges that this data type raises, 
such as dealing with misspellings and colloquial lan¬ 
guage, we also aim to evaluate of concept identification 
techniques that are widely used in the literature to de¬ 
termine how well they perform in comparison to each 
other. Since in the specific application of ADR signal 
detection, linking the concepts to a standard vocabu¬ 
lary provides another level of knowledge that can be 
utilised, we also evaluate this step which we call nor¬ 
malisation. In this section, we formally describe these 
two parts of the task: concept identification and con¬ 
cept normalisation^ and their evaluation metrics. 

4.1 Concept Identification 

Concept identification consists of identifying spans of 
text that represent medical concepts, specifically drugs 
and ADRs. The latter is more challenging because the 
same medical concept can be considered an ADR, a 
symptom, or a disease, depending on the context in 
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Table 1: Best reported scores in part A (concept identification) of ShARe/CLEF 2013 Task 1 and SemEval 
2014 Task 7. 


Task 


Strictness 

Precision 

Recall 

F-Score 

CLEF Tang et al. 

20131 

Strict 

0.800 

0.706 

0.750 

SemEval [Zhang et al.||2014 

0.843 

0.786 

0.813 

CLEF Tang et al. 

20^ 

Relaxed 

0.925 

0.827 

0.873 

SemEval [Zhang et 

al.||2014 

0.916 

0.907 

0.911 


which it is used. We avoid dealing with this complex¬ 
ity and therefore define the goal of the task to be the 
identification of any span of text that could represent 
a drug or an ADR, disregarding the context. 

Spans can be continuous or discontinuous. Spans 
cannot overlap each other, except when several dis¬ 
continuous spans share a common fragment. Figure 
shows some examples of these different span types. In 
the presence of potentially overlapping spans, the an¬ 
notators were asked to select the longest one. 

Concept identification can be framed as a binary 
classification problem and evaluated using precision, 
recall, and F-score as defined below 


Precision = 


riTP 

Ti'j'p -\-Tipp ^ 


4.2 Concept normalisation 

The normalisation step takes the spans that were iden¬ 
tified in the identification step and maps them to a 
concept in an ontology or controlled vocabulary. For 
example, all three mentions of medications Pethidine^ 
Demerol^ and Meperidine are all mapped to one. Pethi¬ 
dine. This step helps to find the links to concepts that 
are semantically similar or identical. 

In our setting, ADR spans are mapped to the Clinieal 
Finding hierarchy of SNOMED CT and in the case of 
drugs to a representative concept in AMT. 

Concept normalisation is often evaluated using a 
metric referred to as accuracy. To avoid confusion with 
the proposed metric for the first part of the task, we 
refer to this metric as effectiveness, which is defined as 


Recall = 


riTP 

riTP+riFN ’ 


F-Score = 2 x 


precision X recall 
precision-\-recall ’ 


where Utp is the number of matching spans, npp is 
the number of spans reported by the system that are 
not part of the gold standard, and is the number 
of spans in the gold standard that were not reported 
by the system. In the strict version of the evaluation, 
the spans are required to match exactly. In the relaxed 
version the spans only need to overlap to be considered 
a positive match. In this case, however, only one to one 
mappings are allowed, i.e. a span can only be mapped 
to one other span. 


These metrics do not consider the correct classifi¬ 
cation of negative examples Sokolova and Lapalme 


2009 . In order to measure the overall effectiveness 


of each system, we propose to use accuracy, which is 
defined as 


. npp + npN 

Accuracy = -, 

npp + TipN + Tipp -h npN 

where npN is the number of spans that are not in the 
gold standard that were not generated by the imple¬ 
mentation under evaluation. Notice that in this task, 
any span that is not part of the gold standard is con¬ 
sidered an incorrect span; negative examples are not 
explicitly enumerated. Given that the total number 
of negative examples is extremely large and that we 
are interested in comparing several methods, the set of 
negative examples is defined as all the spans that are 
created by all the methods under evaluation that are 
not part of the gold standard. 


EffectivenesSstrict = 


npp n U/correct 


Effectiveness^eZaa^ed = 


npp n nQQrprpQQl 

npp 


where npp is the number of spans that match the gold 
standard exactly, ncorrect is the number of spans that 
were mapped to the correct concept in the correspond¬ 
ing ontology, and tg is the total number of identified 
concepts or spans in the gold standard. Notice that the 
relaxed effectiveness metric only considers the spans 
that were correctly identified in the concept identifica¬ 
tion stage, therefore a system that performs very poorly 
overall can still get a very high score on this metric. 


5 Dataset 

In our experiments, we used a publicly available anno¬ 
tated corpus called CSIRO Adverse Drug Event Corpus 
(Cadec)|^ This corpus is a collection of medical posts 
sourced from the medical forum AskaPatienfH The fo¬ 
rum is organised by drug names and allows consumers 
to post reviews on the medications that they are con¬ 
suming in natural language. Figure shows a sample 
from the AskaPatient website on Voltaren. For each 
post shown in one row, Cadec only contains two free- 
text columns: side effects and comments. 

Cadec includes reviews on 12 drugs, a total of 1250 
forum posts. These reviews were manually annotated 
with a set of tags such as drug name, and disease 

® http://dx.doi.org/10.4225/08/5490FA2E01A90 
^ http://WWW.askapatlent.com 
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Drug 

n 


Next time I’ll try my luck with Paracetamol'. 


Drug 


Drug 


I i I 1 

The pill I took consisted of 50 MG Diclofenac and 200 MG Misoprostol. 


ADR 


ADR 


ADR 


ADR 

JO 


... it has left me feeling exausted, and ’depressed. 


Figure 1: Span type examples from our dataset. From top to bottom: a sentence with a continuous annotation; 
a sentence with a discontinuous annotation; and a sentence with multiple discontinuous annotations that share 
a common fragment. 


Table 4: The concepts annotated in the Cadec corpus. 
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Methods 


Tag 

Description 

Drug 

A mention of a medicine or drug. Medicinal 
products and trade names are included, but 
not drug classes (such as NSAIDs). 

ADR 

Mentions of adverse drug reactions clearly 
associated with the drug referenced by the 
post. 

Disease 

A mention of a disease that is the reason 
for the patient taking the drug. 

Symptom 

A mention of a symptom that is the reason 
for the patient taking the drug. 

Finding 

Any other mention of a clinical finding that 
does not fit into the previous categories, for 
example, the mention of a disease that is 
not the reason for the patient taking the 
drug. 


Table 5: Number of documents and span types in the 
training and test sets. 



Training 

Test 

Total 

Documents 

875 

375 

1250 

Continuous spans 

5702 

2350 

8052 

Discontinuous, non-overlapping spans 

57 

37 

94 

Discontinuous, overlapping spans 

688 

281 

969 

Total spans 

6447 

2668 

9115 


name as shown in Table An expert clinical termi- 
nologist then mapped these spans to concepts in Med- 
DRA, SNOMED CT and AMT. When no correspond¬ 
ing concept was available in the ontologies to represent 
the span, the value conceptJess was assigned. A de¬ 
tailed description of the corpus, including the annota¬ 
tion guidelines, can be found in Karimi et al. 2015a . 

To develop and evaluate a machine learning ap¬ 
proach, we divided the data into training and test¬ 
ing sets, using a 70/30 split. Unlike some previous 


work such as Sampathkumari et al. 2014 , we do not 


use k-fold cross-validation to avoid potential bias that 
may be introduced due to the nature of social media 
text Karimi et al. 2015c . Table shows the number 
of documents and span types in each set. 


There are existing tools that are capable of extracting 
medical concepts from free text. One of these tools, 
MetaMap, is used as a baseline for the evaluation. The 
performance is expected to be poor mainly because 
MetaMap was not designed to work with social media 
text, which presents several challenges such as irregu¬ 
larities, including misspellings, colloquial phrases, and 
even novel phrases. 


6.1 Dictionary-based Approaches 


As discussed in Section |3.1[ most existing approaches 
to ADR mining in social media use dictionary-based 
techniques based on pattern matching rules or sliding 
windows to identify drugs and adverse effects in noisy 
text. However, these techniques have never been eval¬ 
uated independently of the overall task, nor been sys¬ 
tematically compared to each other under one setting. 
This is in part because no standard testing set was 
publicly available previously. 


We implemented a method similar to the sliding win¬ 
dow approach used by Yang et al. 2012 , but using the 
Lucene search engine. The medical forum posts were 
indexed and every post became a document. Then, a 
controlled vocabulary was chosen and for each entry 
a phrase search was executed. No stemming or stop 
word removal were used and tokenisation was done us¬ 
ing Lucene’s standard tokeniser, a grammar-based to- 
keniser that implements the Word Break rules from the 
Unicode Text Segmentation algorithm. Any matches 
were transformed into spans. Figure illustrates how 
this approach works when CHV is used as the con¬ 
trolled vocabulary. The process is the same when re¬ 
placing other vocabularies. 


Notice that the concept normalisation step is implic¬ 
itly being done when the spans are created. In the 
event that two different concepts match the same iden¬ 
tical span, the system always selects the concept with 
the lexicographically greater concept id. 
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Aska Patient 


Q Search for a medication 

Browse all medications ABCDEFGHIJKLMNOPQRSTUVWXYZ 


Advanced Search | Heio 


RATE YOUR MEDICINE COMPARE DRUGS RESOURCES NEWS 


Drug Ratings for VOLTAREN 
Average Rating: 2.5 (67 Ratings) 


Rate VOLTAREN 


Results are sorted by Date Rating Was Added 

Key to Ratings: 1=LOW (I would not recommend taking this medicine.) 
5=HIGH (this medicine cured me or helped me a great deal.) 


Compare VOLTAREN with similar: Page: 1 2 

ANTIARTHRinCS Filter Results 


Charts & graphs: Reviews Summary for VOLTAREN | Top 10 Adverse Effects (reported to FDA) 


RATING 

REASON 

SIDE EFFECTS FOR VOLTAREN 

COMMENTS 

SEX 

AGE 

DURATION/ 

DOSAGE 

DATE 

ADDED 

▼ A 




F M 

▼A 

▼ A 

▼ A 

4 

Arthritis, chronic pain 

Rapid weight loss , nausea, 

Relieves pain 1 have associated 
with multiple fractures and long 
term chronic pain however have 
lost 15 kilograms in two weeks 
since taking it, causes nausea, 
diarrhoea ,taking it with pain 
medication, no pain , felt 
fantastic stopped taking it due to 
weight loss however pain has 
increased and is more intense 
since stopped taking it, bloody 
mucous in stools while taking it 

F 

37 

14 days 

IX D 

12/27/2014 

4 

Arthritis due to multiple 
fractures 

Servers stomach cramps, burping, 
diarrhoea ,extremely tired , nausea ,on 
the upside relieved bone and joint aches 

Catch 22 relives body / joint/ 
arthritis pain but causes 
numerous side effects 

F 

37 

3 days 

50mg 3X D 

12/22/2014 

5 

Severe arthritis 

Very minor: 1 find that some days 1 cannot 
tolerate burn-your-mouth-hot spicy foods 
without a little Pepto. Most days there are 
no problems. The benefits FAR outweigh 
losing the extreme zip of a great, hot 
chili. 

Dido is my first line of defence 
against my arthritis. It is by far 
the most effective 
anti-inflammatory I've tried & 
also the one with the least side 
effects. 

F 

43 

6 years 

150 mg/day 

12/5/2014 

Email 

1 

Topical for pre-cancerous 
lesion 

Nausea, diarrhea, gas, intestinal pain, 
upper stomach pain, acid indigestion. 
Symptoms started a week into taking it. 1 
didn't connect the symptoms with the 
medication at first. 

Stopped the drug 2 days ago, 
but side effects lingering. 

F 

47 

2 weeks 

3 percent 

12/2/2014 

1 

Arthritis 



F 

57 

3 months 
75mg 

11/30/2014 


Figure 2: A screenshot of AskaPatient forum posts on Voltaren. 


6.2 Machine Learning Approaches 


Several machine learning approaches have been used 
successfully to do entity recognition in natural language 
text. For example, CRF classifiers have been used to 
identify medical concepts in Electronic Health Records 
(EHRs); however, social media text has very different 
characteristics and is typically noisier. Even though 
these techniques learn from the data, this does not nec¬ 
essarily mean that the performance will be comparable. 
To implement this approach, we used the CRF classi- 

2005P ^ 


her from the Stanford NER suite Finkel et al 


A CRF classifier takes as input different features that 
are derived from the text. The features used in our 
implementation are listed in Table 

One of the challenges of dealing with discontinuous 
spans is representing them in a format that is suit¬ 
able as input to the classifier. Continuous spans are 
typically represented using the standard Begin, Inside, 
Outside (BIO) chunking representation common in the 
most NLP applications, which assigns a B to the first 
token in a span, an I to all the other contiguous to¬ 
kens in the span, and an O to all other tokens that do 
not belong to any span. This format does not support 
the notion of discontinuous spans and several solutions 
have been proposed in previous research to overcome 


^http://nip.stanford.edu/software/CRF-NER.shtml 


Table 6: The features used in our CRF implementation. 


Feature 

Description 

Bag of words 

The words that surround each token. 

N-grams 

Creates features from letter n-grams (sub¬ 
strings of the word). In this implementa¬ 
tion n was set to 6. 

Word shape 

Indicates the shape of the token, for exam¬ 
ple, if the token is composed of only lower 
case letters, upper case letters, or a combi¬ 
nation. 


this limitation. One of these is to treat discontinu¬ 
ous spans as several continuous spans and after classi¬ 
fication use additional machine learning techniques to 
correctly reassemble them. Another alternative is to 
extend the BIO format with additional tags to repre¬ 
sent the discontinuous spans. The latter approach has 
proved more successful in the CLEF and SemEval tasks 
and therefore has been used in our implementation. 

With the extended BIO format, the following addi¬ 
tional tags are introduced: D{B, 1} and H{B, I}. The 
first set of tags is used to represent discontinuous, non¬ 
overlapping spans. The second set of tags is used to 
represent discontinuous, overlapping spans that share 
one or more tokens (the H stands for Head, as in head 
word). Figure]^ shows an example of these types of an- 
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4. Use the positions of the phrase 
query in the matching documents 
to create the spans. 


Figure 3: Diagram illustrating how the Lucene dictionary-based implementations work. 


Table 7: The number of true positives (TP), false posi¬ 
tives (FP), and false negative (FN) spans that are cre¬ 
ated from the process of transforming the ground truth 
spans into the extended BIO format and back. This is 
equivalent to having a perfect classifier. 


Set 

TP 

FP 

FN 

Total 

Training 

6325 

122 

66 

6513 

Test 

2618 

50 

26 

2694 

Total 

8943 

172 

92 

9207 


notations and how they are represented in the extended 
BIO format. 

Notice, however, that there is an obvious limitation 
with this approach: if several discontinuous spans occur 
in the same sentence, then it is impossible to represent 
them unambiguously. In order to determine how this 
limitation might affect the performance of the CRF ap¬ 
proach in the social media dataset, a round trip trans¬ 
formation was performed, using the gold standard an¬ 
notations, that is, the gold standard was transformed 
into the extended BIO format representation and then 
back to the original format. This is equivalent to having 
a perfect CRF classifier. Table shows the number of 
correct (TP), incorrect (FP), and spurious (FN) spans 
created by the round trip process. In practice, the lim¬ 
itations of this format do not have a significant impact 
on the overall performance. Additional techniques to 
deal with ambiguous cases were not pursued and are 
left as future work. 

One of the differences between this approach and the 


dictionary-based approaches is that the CRF classifier 
only identifies the spans that refer to drugs or ADRs 
and does not map them to the corresponding concepts. 
Therefore, the second part of the task has to be imple¬ 
mented independently. 

Two approaches were explored. The first one is based 
on a traditional search method using the Vector Space 
Model (VSM). The Lucene search engine was used for 
this purpose. The target ontology was indexed by cre¬ 
ating a document for each term and storing the corre¬ 
sponding concept id. This means that a concept with 
multiple synonyms generates multiple documents in the 
index. In this case, stemming and stop word removal 
were used. Then, the text of each span was used to 
query the index. When the span included multiple to¬ 
kens the query was not required to match all of them. 
The top ranked concept was assigned to the span and 
if the query returned no results then the span was an¬ 
notated as concept-less. 

The second approach uses Ontoserver [McBride 


et al. 


2012 


a terminology server developed at the 
Australian e-Health Research Centre, that given a 
free-text query returns the most relevant SNOMED 
CT and AMT concepts. Ontoserver uses a purpose- 
tuned retrieval function based on a multi-prefix match¬ 
ing algorithm 


Sevenster et al. 2012 . It also sup¬ 


ports other features such as spell checking and filtering 
based on hierarchies in the ontology. We used ver¬ 
sion 2.3.0 of Ontoserver, which is publicly available 
at http://ontoserver.csiro.au:8080/, The text in 
each span was used as a query. The parameters were 
set so that all terms were not required and, when deal¬ 
ing with SNOMED CT, the results were filtered so that 
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Next time I’ll try my luck with Paracetamol. 



(--1 I I 

The pill I took consisted of 50 MG Diclofenac and 200 MG Misoprostol. 



3 . 


it has left me feeling exausted, and ^depressed 


Figure 4: Examples of how annotation types are represented using the extended BIO format. The O annotations 
are not shown. 


Table 8: A summary of the different methods that were 
evaluated. 


Table 9: The contingency table used as input to Mc- 
Nemar’s test, used to test statistical significance. 


Method 

Description 

MetaMap 

The baseline method. MetaMap was used 
to identify and normalise the concepts in 
the social media text. 

VSM + UMLS 

A dictionary-based approach based on slid¬ 
ing window that uses the UMLS as the un¬ 
derlying controlled vocabulary. 

VSM + CHV 

A dictionary-based method based on sliding 
window that uses CHV, list of colloquial 
health terms. 

VSM + SCT 

A dictionary-based approach based on slid¬ 
ing window that uses SNOMED CT as the 
underlying controlled vocabulary. This im¬ 
plementation is used to identify ADRs. 

VSM + AMT 

A dictionary-based approach based on slid¬ 
ing window that uses AMT as the underly¬ 
ing controlled vocabulary. This implemen¬ 
tation is used to identify drugs. 

CRF + VSM 

A mixed approach that uses a CRF clas¬ 
sifier to identify the concept spans and a 
VSM implementation to map these spans 
to concepts in a controlled vocabulary 
(SNOMED CT for ADRs and AMT for 
drugs). 

CRF + Ontoserver 

A mixed approach that uses a CRF classi¬ 
fier to identify the concept spans and On¬ 
toserver to map these spans to concepts in 
a controlled vocabulary. 


only those concepts that belong to the Clinical Finding 
hierarchy were returned. When a query returned no 
results, the span was annotated as concept Jess. 

A summary of all the methods that were imple¬ 
mented is shown in Table [H 


6.3 Statistical Significance 


To determine if the improvements obtained with any 
two different methods were statistically significant, we 
used McNemar’s test [Davis et~ar 


2012 . This test is 


applied to paired nominal data using a 2 x 2 contingency 
table to determine if row and column marginal frequen¬ 
cies are equal. The contingency table is shown in Ta¬ 
ble where A is the number of correct predictions by 


Method 1 


Method 2 
Correct Wrong 

Correct 
Wrong 


A 

B 

C 

D 


both methods; B is the number of correct predictions 
by Method 1 where Method 2 produced an incorrect 
prediction; C is the number of correct predictions by 
method 2 where method 1 produced an incorrect pre¬ 
diction; and D is the number of incorrect predictions 
by both methods. 


7 Results and Discussion 


The results of the concept identification task are shown 
in Table 10 There are several noteworthy results. 


First, the CRF implementation outperforms MetaMap 
and all the dictionary-based implementations in all of 
the metrics that were considered, in both strict and re¬ 
laxed modes. Also, notice that in some cases the over¬ 
all ranking provided by the F-Score value is different 
from the ranking provided by the accuracy value. In 
particular, when dealing with ADR identification, the 
MetaMap implementation has a higher accuracy than 
the VSM+UMLS implementation despite its precision, 
recall and F-Score being much lower. This happens be¬ 
cause the VSM+UMLS implementation, despite pro¬ 
ducing more correct spans than the MetaMap imple¬ 
mentation, also produces many more incorrect spans. 

The task of identifying drugs is considerably different 
from the task of identifying ADRs because it usually in¬ 
volves less ambiguity. For example, trade products usu¬ 
ally have no synonyms and therefore limit the number 
of ways a person can refer to them (this of course does 
not rule out misspellings, which are common in social 
media). Because of this, intuitively, this task should 
be easier than the task of identifying ADRs. The re¬ 
sults show that the CRF implementation indeed per¬ 
forms better in this task that in the ADR identification 
task. Note also that MetaMap obtains very low preci- 


9 



































Table 10: Evaluation results of the concept identification task, sorted by accuracy. Statistical significant differ¬ 
ence with the next best performing method is indicated with * (p <0.01). 


Entities Type 

Method 

Precision 

Recall 

F-Score 

Accuracy 


VSM+UMLS 

0.264 

0.392 

0.316 

0.454 


MetaMap 

0.105 

0.080 

0.091 

0.485* 

Strict 

VSM+CHV 

0.457 

0.370 

0.409 

0.656* 


VSM+SCT 

0.498 

0.352 

0.412 

0.678* 

ADRs 

CRF 

0.644 

0.565 

0.602 

0.760* 


VSM+UMLS 

0.454 

0.674 

0.543 

0.635 


VSM+CHV 

0.747 

0.605 

0.669 

0.807* 

Relaxed 

MetaMap 

0.794 

0.605 

0.687 

0.822* 


VSM+SCT 

0.818 

0.578 

0.677 

0.822 


CRF 

0.908 

0.797 

0.849 

0.909* 


VSM+UMLS 

0.160 

0.882 

0.271 

0.546 


VSM+AMT 

0.160 

0.775 

0.266 

0.589* 

Strict 

MetaMap 

0.022 

0.021 

0.021 

0.816* 


VSM+CHV 

0.468 

0.856 

0.605 

0.893* 

Dm^s 

CRF 

0.943 

0.840 

0.889 

0.980* 


VSM+UMLS 

0.168 

0.923 

0.284 

0.554 


VSM+AMT 

0.173 

0.837 

0.287 

0.601* 

Relaxed 

MetaMap 

0.145 

0.139 

0.142 

0.839* 


VSM+CHV 

0.489 

0.893 

0.632 

0.900* 


CRF 

0.979 

0.872 

0.923 

0.986* 


sion and recall. This is because the tool was not de¬ 
signed to identify drugs. Also, most of the dictionary- 
based implementations achieve good recall but low pre¬ 
cision; this is likely due to some of the constraints in the 
annotation guidelines, for example, that indicate that 
drug classes should be excluded. If the drug classes are 
mentioned frequently and are part of the underlying 
controlled vocabularies then this will create many false 
positives. In contrast, the CRF implementation is ca¬ 
pable of identifying some of the common drug classes 
that are not annotated in the training set and is able 
to avoid creating false positives in most cases. 


Table pT] shows the results of the concept normalisa¬ 
tion task. In this case the strict metric is more relevant, 
because some implementations can achieve a very high 
score in the relaxed version despite having a very poor 
overall performance. The results show that Ontoserver 
outperforms the other approaches. Overall, however, 
the results are quite poor. This highlights two impor¬ 
tant aspects of the task. First, it is inherently difficult 
to map colloquial language to ontologies that contain 
more formal terms. Second, because in this task the 
goal is to map the spans to SNOMED CT concepts, 
the quality of the results when using approaches that 
rely on other controlled vocabularies will depend on 
the quality of the mappings between those vocabular¬ 
ies and SNOMED CT. For example, when using the 
VSM+CHV implementation, even if the term in the 
text appears in CHV, if this term is mapped to an in¬ 
correct concept in SNOMED CT, the implementation 
will produce an incorrect result. Even though this issue 
has not been explored in depth, some of the potential 
problems include mappings to concepts that are now 
inactive (and therefore will never appear in the gold 
standard) and mappings to concepts in other versions 
of SNOMED CT (for example SNOMED US, which 
shares a common subset with SNOMED AU but also 
includes some local concepts that will not appear in the 


Australian version). 

For example, in the MetaMap implementation the 
concept 366981002 (Pain) is returned as the top con¬ 
cept for several spans and this concept has been re¬ 
placed in the current version with 22253000 (Pain). It 
may be possible to automatically replace an inactive 
concept with the current concept that replaced it; how¬ 
ever, this option was not attempted and is left as future 
work. 

It was also expected that the different methods would 
perform better when normalising drugs than when nor¬ 
malising ADRs. For most implementations this turned 
out to be true, except for the dictionary-based meth¬ 
ods that are not based on AMT. These methods were 
unable to normalise any concepts at all because a map 
between the other controlled vocabularies and AMT 
does not currently exist. 

Finally, a strict evaluation of the full task was car¬ 
ried out, where a span was only considered correct if it 
matched the gold standard span exactly and was an¬ 
notated with the same concept. This evaluation is im¬ 
portant because a good free text annotation system will 
not only need to identify relevant spans but also anno¬ 
tate them correctly. The results for the full evaluation 
are shown in Table The best performing system 
overall was the CRF implementation using Ontoserver 
for concept normalisation. 

8 Conclusions and Future Work 

Pharmacovigilance has passed the era where it would 
only rely on manual reports of potential drug adverse 
effects. Actively detecting signals of adverse drug reac¬ 
tions through automated methods of text mining con¬ 
sumer reviews is one of the emerging areas. 

We conducted an empirical evaluation of different 
methods to automatically identify and normalise med¬ 
ical concepts in the domain of adverse drug reaction 
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Table 12: Results of the evaluation of the full task applied to ADRs, sorted by accuracy. Statistical significant 
difference with the next best performing method is indicated with * (p <0.01). 


Entities 

Name 

Precision 

Recall 

F-Score 

Accuracy 


VSM+UMLS 

0.088 

0.104 

0.095 

0.363 


MetaMap 

0.041 

0.029 

0.034 

0.468* 

ADRs 

VSM+CHV 

0.218 

0.106 

0.143 

0.590* 


CRF+VSM 

0.564 

0.327 

0.414 

0.702* 


VSM+AMT 

0.572 

0.332 

0.420 

0.706 


CRF+Ontoserver 

0.771 

0.376 

0.506 

0.764* 


VSM+UMLS 

0.000 

0.000 

0.000 

0.461 


VSM+AMT 

0.163 

0.758 

0.269 

0.605* 

Drugs 

VSM+CHV 

0.000 

0.000 

0.000 

0.814* 


MetaMap 

0.000 

0.000 

0.000 

0.814 


CRF+VSM 

0.988 

0.749 

0.852 

0.975* 


CRF+Ontoserver 

0.988 

0.773 

0.867 

0.977 


Table 11: Results of the evaluation of the concept nor¬ 
malisation task. Baseline is MetaMap. 


Entities 

Type 

Method 

Effectiveness 



MetaMap 

0.029 



VSM+UMLS 

0.105 


Strict 

VSM+CHV 

0.106 



CRF+VSM 

0.327 



VSM+SCT 

0.332 

ADRs 


CRF+Ontoserver 

0.376 


MetaMap 

0.363 



VSM+UMLS 

0.266 


Relaxed 

VSM+CHV 

CRF+VSM 

0.287 

0.578 



VSM+SCT 

0.943 



CRF+Ontoserver 

0.666 



MetaMap 

0.000 



VSM+UMLS 

0.000 


Strict 

VSM+CHV 

0.000 



CRF+VSM 

0.749 



CRF+Ontoserver 

0.773 

Drugs 


VSM+AMT 

0.758 


MetaMap 

0.000 



VSM+UMLS 

0.000 


Relaxed 

VSM+CHV 

CRF+VSM 

0.000 

0.891 



CRF+Ontoserver 

0.920 



VSM+AMT 

0.978 


detection in medical forums. It included several meth¬ 
ods commonly used in the ADR mining literature, as 
well as state-of-the-art machine learning methods that 
have been used in other domains. To our knowledge 
this is the first study to systematically compare the 
most common concept identification and normalisation 
approaches used in adverse effect mining from social 
media under a controlled setting. This is an important 
step in ADR signal detection which determines the ef¬ 
fectiveness of automated systems in this domain. 

The experimental results showed that the CRT im¬ 
plementation combined with Ontoserver outperformed 
all the other methods that were evaluated, including 
MetaMap and the dictionary-based methods. We be¬ 
lieve that the availability of the new Cadec corpus and 
the empirical results shown in this paper will benefit 
other researchers working on ADR mining methods. 


In the future, we plan to improve the CRT method 
with additional features, specifically domain specific 
features that are likely to improve recognition of ADRs 
in text. 


Regarding the concept normalisation task, the re¬ 
sults showed that there is still room for improvement. 
There are two avenues to explore. The concept nor¬ 
malisation could also be evaluated completely indepen¬ 
dently by using the spans in the gold standard as input. 
Second, to the best of our knowledge, existing concept 
normalisation implementations, including the ones im¬ 
plemented in this work, do not make use of the context 
of the spans. We believe more advanced methods may 
benefit from having access not only to the text in the 
span but also to the surrounding tokens and previously 
identified concepts. 


Regarding the evaluation, there are several ways that 
the current methods could be extended. In many use 
cases, generating the exact same span as in the gold 
standard is not relevant. However, the current defi¬ 
nition of a relaxed match is too loose and might not 
work appropriately in certain situations. For exam¬ 
ple, when spans tend to be long and include multiple 
tokens, a single token overlap constitutes a positive re¬ 
laxed match. An improvement over the relaxed match¬ 
ing criteria would be to consider the extent of the match 
by establishing a threshold based on either a ratio of 
characters or tokens that are required for the overlap 
to be considered a valid match. Using a high threshold 
would ensure that the systems under this relaxed eval¬ 
uation are only producing spans with minimal differ¬ 
ences (such as including prepositions before a noun or 
adjacent punctuation symbols, for example). Several 
metrics that could be adapted for this scenario have 
been proposed in the area of passage retrieval [Wade 


and Allan, 2005 


When evaluating the concept normalisation task, the 
current evaluation method only considers a span to be 
correct if it is assigned the same concept found in the 
gold standard. However, considering that the annota¬ 
tions in the Cadec corpus come from an ontology, if 
the span is annotated with a concept that is very close 
to the concept in the gold standard, say a parent con¬ 
cept, then considering the span to be completely wrong 


11 


















seems too severe. Modifying the evaluation metric to 
consider this ‘semantic distance’ will likely give a better 
sense of the performance of the systems under evalua¬ 
tion. 
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